Many colleges want to optimize the money they receive from their alumni. In order to do so, they need to identify and predict the salary/unemployment rate of recent graduates based on their education and other various factors. Doing so, they will be able to put more money into those programs to get a larger return on their investments (students).
Business Question:
Where can colleges put money in order to optimize the amount of money they receive from recent graduates?
Analysis Question:
Based on recent graduates and their characteristics/education, what would be their predicted median salary? Would they make over or less than six figures?
This data is pulled from the 2012-12 American Community Survey Public Use Microdata Series, and is limited to those users under the age of 28. The general purpose of this code and data is based upon this story. This story describes the dilemma among college students about choosing the right major, considering the financial benefits of the field and the its maximized chance to graduate. It breaks down the overarching majors like “Engineering” and “STEM,” and dives deeper into what each major means in terms of later financial stability and its popularity in comparison to other majors. The actual dataset contains a detailed breakdown about basic earnings as well as labor force information, taking into account sex and the type of job acquired post graduation.
A brief look at the raw data can be found below.
## 'data.frame': 172 obs. of 21 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Major_code : int 2419 2416 2415 2417 2405 2418 6202 5001 2414 2408 ...
## $ Major : chr "PETROLEUM ENGINEERING" "MINING AND MINERAL ENGINEERING" "METALLURGICAL ENGINEERING" "NAVAL ARCHITECTURE AND MARINE ENGINEERING" ...
## $ Total : int 2339 756 856 1258 32260 2573 3777 1792 91227 81527 ...
## $ Men : int 2057 679 725 1123 21239 2200 2110 832 80320 65511 ...
## $ Women : int 282 77 131 135 11021 373 1667 960 10907 16016 ...
## $ Major_category : chr "Engineering" "Engineering" "Engineering" "Engineering" ...
## $ ShareWomen : num 0.121 0.102 0.153 0.107 0.342 ...
## $ Sample_size : int 36 7 3 16 289 17 51 10 1029 631 ...
## $ Employed : int 1976 640 648 758 25694 1857 2912 1526 76442 61928 ...
## $ Full_time : int 1849 556 558 1069 23170 2038 2924 1085 71298 55450 ...
## $ Part_time : int 270 170 133 150 5180 264 296 553 13101 12695 ...
## $ Full_time_year_round: int 1207 388 340 692 16697 1449 2482 827 54639 41413 ...
## $ Unemployed : int 37 85 16 40 1672 400 308 33 4650 3895 ...
## $ Unemployment_rate : num 0.0184 0.1172 0.0241 0.0501 0.0611 ...
## $ Median : int 110000 75000 73000 70000 65000 65000 62000 62000 60000 60000 ...
## $ P25th : int 95000 55000 50000 43000 50000 50000 53000 31500 48000 45000 ...
## $ P75th : int 125000 90000 105000 80000 75000 102000 72000 109000 70000 72000 ...
## $ College_jobs : int 1534 350 456 529 18314 1142 1768 972 52844 45829 ...
## $ Non_college_jobs : int 364 257 176 102 4440 657 314 500 16384 10874 ...
## $ Low_wage_jobs : int 193 50 0 0 972 244 259 220 3253 3170 ...
## - attr(*, "na.action")= 'omit' Named int 22
## ..- attr(*, "names")= chr "22"
As can be seen above, many of the categories are integer values. Many of these variables can be converted into factor variables in addition to the numerical ones. In addition, the variables Rank, Major Code, and Major can be dropped as the Rank variable highly correlates with the salary variable, and the other two are to specific and cannot be generalized.
In addition, the categorical variable categories can be compressed in order for more useful data for the analysis.
In order to do some analysis, all categorical variables need to be one hot encoded, which is done below:
Before beginning with the analytical part of the exploration, it is beneficial to visualize and summarize the data in order to get a better understanding of the data in its entirety, and with an emphasis on variables you believe to be important for your analysis.
The following is the 5 number summary of the median salary variable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 22000 33000 36000 40077 45000 110000
The median of the medians is 36,000, with a max salary 110,000.
## Total Men Women ShareWomen Sample_size Employed Full_time
## Total 1.0000000 0.8780884 0.9447645 0.1429993 0.9455747 0.9962140 0.9893392
## Men 0.8780884 1.0000000 0.6727589 -0.1120136 0.8751756 0.8706047 0.8935631
## Women 0.9447645 0.6727589 1.0000000 0.2978321 0.8626064 0.9440365 0.9176812
## Part_time Full_time_year_round Unemployed Unemployment_rate Median
## Total 0.9502684 0.9811118 0.9747684 0.08319170 -0.1067377
## Men 0.7515917 0.8924540 0.8694115 0.10150234 0.0259906
## Women 0.9545133 0.9057195 0.9116943 0.05910776 -0.1828419
## P25th P75th College_jobs Non_college_jobs Low_wage_jobs
## Total -0.07192608 -0.08319767 0.8004648 0.9412471 0.9355096
## Men 0.03872518 0.05239290 0.5631684 0.8514998 0.7913360
## Women -0.13773826 -0.16452834 0.8519460 0.8721318 0.9044699
The above confusion matrix details the correlation coefficients between all the respective variables with “Total,” “Men,” and “Women.” The correlation coefficient is a measure of the relationship strength between two different variables, with the magnitude closest to 1 or -1 indicating there is a strong direct and/or indirect relationship. Based on the output, it is important to note the differences among the “Employed” between men and women. Comparatively, there is a stronger direct relationship between women being employed (~0.945) when compared to men (~0.878). Similarly, women are more prone to work part-time (~0.917) when compared to men (~0.894). On the other hand, when comparing the median variable, which describes the median earnings of full-time year-round workers, women tend to have a slight inverse relationship (~ -0.182) whereas men have a slight direct relationship (~0.025). This is an important dissimilarity, considering women are slightly more employed yet do not payed as much in comparison
Now, we can visualize the dataset. To do this, we used the ggplot and plot_ly packages.
As can be seen above, the first graph we created is a polar graph. A polar graph allows the reader to understand the sampling distribution, as well as the amount of representation each major category has in the dataset. The larger the slice, the more representation the category has in the dataset. From the polar chart, Sciences has the largest amount of representation, followed closely by the Other category. STEM is third, but by a large margin, and Arts is last.
The next graph we created was a stacked bar graph. The major category is on the x-axis, while the count - normalized to be between 0 and 1 - is on the y-axis. The fill of the graph is based on whether or not a person from that category has a median salary that is larger than $50,000. From this graph, it seems that STEM majors have almost 50 percent of their category making above 50K per year - the largest percentage of the four major categories. The other three major categories are nowhere close to STEM, with the Other category coming in second with about 7 percent of their category making above 50K. Science is third with what seems to be about 1 percent of their category making above 50K, and Art is last with what seems to be 0 percent of their category making above 50K per year.
For our third graph, we decided to make a box-plot graph where the x-axis is the median salary and the y-axis is the four major categories. From this graph, it can be deduced that the range of STEM majors is higher than that of any other major. The range of STEM majors seems to be about 40-50K, whereas the other majors have a maximum range of 30K. There is a STEM major who currently has a median salary of 120K, which is almost double the highest median salary of any other major category. Another interesting aspect about the STEM box-plot, when compared to the other three, is that the median salary for the 25th percentile of STEM is equal to about 45K, which is higher than the median salary of the 75th percentile for any other category. The other three boxplots are relatively similar to each other, with the Art category being much skinnier than the other two. The skinnier the graph, the smaller the range of the graph.
Building a base multiple linear regression model:
## Linear Regression
##
## 121 samples
## 21 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 5 times)
## Summary of sample sizes: 109, 109, 109, 109, 110, 109, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2968.04 0.9299386 2141.409
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
The linear regression outputted an RMSE of 2968.04, an R-squared of 0.9299, and an MAE of 2141.409. The R-squared value is very close to one, which is very good for a linear regression model. RMSE is the root-mean square error and is a standard way to measure the error for a model. Because our data was in the thousands, the resulting RMSE is respectable in this case. The R-squares evaluates the scatter of the data points in relation to the fitted regression line. A higher number means that there is higher correlation to the chosen line and the scatter of the data points. The MAE is the mean absolute error and indicates the average absolute difference between the observations. The MAE is a fairly good value in context of the overarching problem.
##
## Call:
## lm(formula = .outcome ~ ., data = dat, verbose = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11789.8 -1336.5 -29.7 1237.7 8949.7
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.747e+03 2.098e+03 4.170 6.41e-05 ***
## Total 3.257e-02 1.220e-01 0.267 0.78993
## Men -5.538e-02 4.253e-02 -1.302 0.19582
## Women NA NA NA NA
## Major_category_Sciences -3.116e+03 9.623e+02 -3.238 0.00163 **
## Major_category_Arts -2.032e+03 1.241e+03 -1.638 0.10458
## Major_category_Other -3.464e+03 1.041e+03 -3.327 0.00122 **
## Major_category_STEM NA NA NA NA
## ShareWomen -3.739e+03 1.787e+03 -2.093 0.03885 *
## Sample_size -1.672e+00 5.540e+00 -0.302 0.76344
## Employed -1.128e-01 4.068e-01 -0.277 0.78214
## Full_time 2.267e-01 3.795e-01 0.597 0.55169
## Part_time -7.459e-02 4.273e-01 -0.175 0.86176
## Full_time_year_round -8.162e-02 3.125e-01 -0.261 0.79451
## Unemployed -5.541e-04 4.759e-01 -0.001 0.99907
## Unemployment_rate -1.466e+04 1.054e+04 -1.391 0.16720
## P25th 6.389e-01 4.319e-02 14.793 < 2e-16 ***
## P75th 3.401e-01 2.943e-02 11.558 < 2e-16 ***
## College_jobs -3.976e-03 5.460e-02 -0.073 0.94209
## Non_college_jobs -2.613e-02 1.409e-01 -0.185 0.85325
## Low_wage_jobs 1.334e-01 3.772e-01 0.354 0.72427
## High.Unemployment_Low NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2938 on 102 degrees of freedom
## Multiple R-squared: 0.9536, Adjusted R-squared: 0.9454
## F-statistic: 116.4 on 18 and 102 DF, p-value: < 2.2e-16
The resulting coefficients for the multiple linear regression are found above. The y-intercept in this context is the amount of yearly earnings (median) someone would have if all of the other variables were set to 0. This does not really make sense in the real-world. The interpretation for each of the slope variables is the same as simple linear regression, BUT all the other variables have to be fixed. For example, if all the other variables are fixed, then for every 1 unit increase in someones P25th percentile, their median salary earned increases by 0.64 dollars. All the other variables need to be fixed because it will change the interpretation.
With the built model, we can now use it to predict.
## 1
## 52904.95
Using random values for the inputs of the multiple linear regression model, we can predict an actual value for the median salary. In this case, the predicted outcome for the following inputs
Total=3000, Men=500, Women=500, Major_category_Sciences=0, Major_category_Arts=0, Major_category_Other=0, Major_category_STEM=1, ShareWomen=0.76, Sample_size=4000, Employed=300, Full_time=234, Part_time=0, Full_time_year_round=100, Unemployed=20, Unemployment_rate=0, P25th=20000, P75th=120000, College_jobs=3343, Non_college_jobs=2, Low_wage_jobs=223, High.Unemployment_Low=0
is 52904.95 dollars.
## lm variable importance
##
## Overall
## P25th 100.0000
## P75th 78.1243
## Major_category_Other 22.4822
## Major_category_Sciences 21.8797
## ShareWomen 14.1398
## Major_category_Arts 11.0630
## Unemployment_rate 9.3969
## Men 8.7946
## Full_time 4.0294
## Low_wage_jobs 2.3835
## Sample_size 2.0322
## Employed 1.8665
## Total 1.7978
## Full_time_year_round 1.7575
## Non_college_jobs 1.2458
## Part_time 1.1723
## College_jobs 0.4845
## Unemployed 0.0000
Relating to the model, the most important variable was ‘P25th’ with an overall importance of 100% when predicting the salary (median) variable. Similar to that variables, ‘Major_category_Other’ and ‘P75th’ were also important variables that relate to the prediction of the median classifier variable, with an overall importance of approximately 22.5% and 78% respectively. These are viable results, given that one’s percentile of earnings and the major they choose are fairly telling signs of someone’s future salary.
First, a combined target variable was created consisting of the median salary, the employment rate (1 - Unemployment Rate), and the percentage of women (Share of Women). The resulting combined target is the Median salary multiplied by the employment rate multiplied by the share of women:
combined_target <- Median * (1 - Unemployment_rate) * ShareWomen
A new data frame was created which combined the original majors data frame with the combined_target variable.
## [1] 0.3953488
The initial baserate for the classifier variable, combined variable, is approximately 0.3953, representative of the percentage of positive entries in relation to all of the entries for the combined variable. When building a classifier, the lower the baserate is, the better the associated model should be at efficiently predicting the positive entries. For example, if a model had an initial baserate of 75% and a model accuracy of 50%, that would be a better fitted model than if the original baserate was 25% comparatively.
## [1] 4.472136
The mtry parameter in the random forest algorithm defines the number of variables randomly sampled as candidates for each split. The default number of this classification would be the square root of the number of variables. As indicated in the above output, the default mtry should be 4, rounded down from the 4.47 value.
## LE.EQ.20K G.20K class.error
## LE.EQ.20K 66 7 0.09589041
## G.20K 12 36 0.25000000
## [1] "Model Accuracy: 84.0572347811362 %"
As indicated in the confusion matrix of the random forest model, the class error for the negative class (“LE.EQ.20K”) is approximately 9.59%. This is a considerably good value as the error is fairly close to 0. On the other hand, the positive class has an approximate class error of 25.00%. The positive class, characterized by those the combined variable over 20K, is the variable being predicted so it is important that the error for that particular class is minimized as much as possible in further optimization for the model. Overall, the model is rather accurate, with an approximate accuracy of 84.057% attributed to the low class error for the negative. As a model, it is not bad, but should be properly optimized to minimize the positive class error to avoid an increased amount of false positives and false negatives.
## Number of Trees Out of the Box <=20K >20K Diff
## 59 59 0.08264463 0.04109589 0.1458333 0.1047374
After building the initial random forest algorithm, it will be helpful to use the class error rates as well as the OOB error rate to identify the most optimal number of trees to tune the model. In order to identify such a value, the respective error rates from the model were converted into a data frame that was then ordered in descending order for both the OOB column and the positive class (“>20K”) column. The top row should therefore contain the number of trees that encapsulates the minimum OOB error and positive class error rate for that particular algorithm. For this current algorithm, it would be benefical when optimizing to use 59 as the value for the number of trees when building the random forest model.
## mtry = 4 OOB error = 13.22%
## Searching left ...
## mtry = 2 OOB error = 23.97%
## -0.8125 0.05
## Searching right ...
## mtry = 8 OOB error = 11.57%
## 0.125 0.05
## mtry = 16 OOB error = 7.44%
## 0.3571429 0.05
## mtry = 20 OOB error = 8.26%
## -0.1111111 0.05
Similar to identifying the most optimal number of trees, the value for mtry that is correlated to the lowest possible OOB error rate would be the most beneficial to use when generating any further random forest algorithms. Based on the outputted graph, the value of 16 of mtry was more optimal when compared to the previous default of 4 for the model. The difference between the individual values for 16 and 20 for mtry are rather small but when compared to the smaller values, this will be important to make note of when optimizing the model in the future.
##
## Call:
## randomForest(x = x, y = y, mtry = res[which.min(res[, 2]), 1])
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 16
##
## OOB estimate of error rate: 9.09%
## Confusion matrix:
## LE.EQ.20K G.20K class.error
## LE.EQ.20K 66 7 0.09589041
## G.20K 4 44 0.08333333
After adjusting for the model optimizations, there has been a large decrease in the accuracy for both the positive and negative class errors. The positive class (“G.20K”) has a class error of approximately 9.59% and the negative class (“LE.EQ.20K”) has a class error of approximately 8.33%. Both values are very optimal given that they are fairly close to 0. Future adjustments can be made but that is a positive improvement to build off of.
## LE.EQ.20K G.20K class.error
## LE.EQ.20K 64 9 0.1232877
## G.20K 6 42 0.1250000
## [1] "Model Accuracy: 87.4239150390487 %"
Compared to the initial model, this current random forest model is much more optimal. The class error for both the positive and negative classes were approximately 12%. There is a slight drawback regarding the negative class given that the previous class error was approximately 9.59%. However, this is still favorable because the positive class decreased a large percentage and that is the variable that is being predicted, so it is therefore prioritized. Additionally, the accuracy of the model increased slightly from about 84.057% previously to now 87.423% which is also favorable. This increase can be attributed to the lower class error for the positive class. As a model, it still requires some improvements in order to decrease the positive class error to be as close to 0 as possible.
## [1] "Original Algorithm F1 Score: 81.8181818181818 %"
## [1] "Original Algorithm Accuracy: 84 %"
## [1] "Optimized Algorithm F1 Score: 90.9090909090909 %"
## [1] "Optimized Algorithm Accuracy: 92 %"
In comparison to previous evaluations, this final model has an increase in both accuracy and F1 score when predicted with the test test. The original evaluation of the model had an accuracy of approximately 81.818% and a F1 score of approximately 84%. After adjusting hyper-parameters and tuning the model, the final evaluation of the model does have a positive increase in both accuracy (92%) and F1 (90.909%) which is a favorable. The increases aren’t super significant which can be attributed to the nature of the data and the inability for the model to continue to be optimized to a large degree even with improving parameters. Overall, the model is pretty good and has been well optimized as the value for both F1 and the accuracy is fairly close to 100%, which would be the most optimal.
I believe that our model is fair and accounts for the singular protected class present in our dataset - women. Our dataset has a variable that accounts for the percentage of the workforce that women hold. Because of this, the models that we create are able to determine whether women are being treated fairly in the workplace or not. For example, our linear regression model is able to output that women are being paid an unfair median salary based on the amount of the market share they have. If our dataset did not have the Sharewomen variable, our model would not be able to determine whether women are being paid fairly or not.
What can you say about the results of the methods section as it relates to your question given the limitations to your model?
Our business question was how should colleges allocate money in order to optimize the amount of donation money they receive from recent graduates. In other words, based on recent graduates and their characteristics and education, what would be their predicted median salary (assuming a higher median salary leads to the student donating more money)? From the data visualization, it can be deduced that the college/university should be putting money into the majors that are under the STEM category. STEM majors have a much higher median salary, a larger range of median salary, and a lower unemployment rate. Thus, they will have more expendable income to potentially donate back to the college as alumni.
From the linear regression model, we can see how the median salary is affected by all of the other variables. It can be seen that for every 1 unit of increase in someone’s 25th percentile of their median salary, their median salary increases by 0.64 units. From the random forest model, a combined target variable of a created consisting of the median salary, the employment rate (1 - unemployment rate), and the percentage of women in the workplace (sharewomen). Using this, we were able to predict the values for this target variable and find the values that impacted the target variable the most.
Overall, based on our analysis, the college/university that asked us to do this analysis should allocate money towards the STEM major category, as they have the highest median salary and the highest range of salary, which means they would donate more money back to their original school. Being a STEM major also lowers the chance that an alumni is unemployed or is working in a low wage job.
One additional piece of analysis that would benefit the report as a whole is using recently recorded data. The data that was used in this analysis was recorded from 2010-2012, so the trends that were discovered from our analysis are most likely outdated. Having new data would greatly benefit the university that wanted this report, as they would be able to adjust major categories based on newer trends rather than older ones. Another additional piece of analysis that would benefit our report would be the addition of the decision tree model. In our analysis, we included linear regression and the random forest model. However, we never include the decision tree model, which would have allowed us to see a model where the most optimal choice was made every time - since the decision tree is a greedy algorithm by nature. Including the decision tree would have made our analysis more diverse and well-rounded, as we would have had performed an analysis using three different major analytic methods. Personally, we don’t believe that anything limited our analysis on the project - the dataset was easy to work with and the models that we created learned the data efficiently and effectively.